Serveur d'exploration sur la visibilité du Havre

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Adaptative quality control of digital documents in mass digitization projects

Identifieur interne : 000244 ( Main/Exploration ); précédent : 000243; suivant : 000245

Adaptative quality control of digital documents in mass digitization projects

Auteurs : Ahmed Ben Salah [France]

Source :

RBID : Hal:tel-01164698

Descripteurs français

English descriptors

Abstract

This work focuses on the assessment of characters recognition results produced automatically by optical character recognition software (OCR on mass digitization projects. The goal is to design a global control system robust enough to deal with BnF documents collection. This collection includes old documents which are difficult to be treated by OCR. We designed a word detection system to detect missed words defects in OCR results, and a words recognition rate estimator to assess the quality of word recognition results performed by OCR.We create two kinds of descriptors to characterize OCR outputs. Image descriptors to characterize page segmentation results and cross alignment descriptors to characterize the quality of word recognition results. Furthermore, we adapt our learning process to make an adaptive decision or prediction systems. We evaluated our control systems on real images selected randomly from BnF collection. The mmissed word detection system detects 84.15% of words omitted by the OCR with a precision of 94.73%. The experiments performed also showed that 80% of the documents of word recognition rate less than 98% are detected with an accuracy of 92%. It can also automatically detect 45% of the material having a recognition rate less than 70% with greater than 92% accuracy.

Url:


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Adaptative quality control of digital documents in mass digitization projects</title>
<title xml:lang="fr">Maîtrise de la qualité des transcriptions numériques dans les projets de numérisation de masse</title>
<author>
<name sortKey="Ben Salah, Ahmed" sort="Ben Salah, Ahmed" uniqKey="Ben Salah A" first="Ahmed" last="Ben Salah">Ahmed Ben Salah</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-399419" status="INCOMING">
<orgName>DocApp et Rfai</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-23832" type="direct"></relation>
<relation active="#struct-300317" type="indirect"></relation>
<relation name="EA4108" active="#struct-300318" type="indirect"></relation>
<relation active="#struct-301288" type="indirect"></relation>
<relation active="#struct-301232" type="indirect"></relation>
<relation active="#struct-203066" type="direct"></relation>
<relation active="#struct-302209" type="indirect"></relation>
<relation active="#struct-204893" type="direct"></relation>
<relation name="EA6300" active="#struct-300298" type="indirect"></relation>
<relation active="#struct-300408" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-23832" type="direct">
<org type="laboratory" xml:id="struct-23832" status="VALID">
<orgName>Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes</orgName>
<orgName type="acronym">LITIS</orgName>
<desc>
<address>
<addrLine>Avenue de l'Université UFR des Sciences et Techniques 76800 Saint-Etienne du Rouvray</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.litislab.eu</ref>
</desc>
<listRelation>
<relation active="#struct-300317" type="direct"></relation>
<relation name="EA4108" active="#struct-300318" type="direct"></relation>
<relation active="#struct-301288" type="direct"></relation>
<relation active="#struct-301232" type="indirect"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-300317" type="indirect">
<org type="institution" xml:id="struct-300317" status="VALID">
<orgName>Université du Havre</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA4108" active="#struct-300318" type="indirect">
<org type="institution" xml:id="struct-300318" status="VALID">
<orgName>Université de Rouen</orgName>
<desc>
<address>
<addrLine> 1 rue Thomas Becket - 76821 Mont-Saint-Aignan</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-rouen.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-301288" type="indirect">
<org type="department" xml:id="struct-301288" status="VALID">
<orgName>Institut National des Sciences Appliquées - Rouen</orgName>
<orgName type="acronym">INSA Rouen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-301232" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-301232" type="indirect">
<org type="institution" xml:id="struct-301232" status="VALID">
<orgName>Institut National des Sciences Appliquées</orgName>
<orgName type="acronym">INSA</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-203066" type="direct">
<org type="laboratory" xml:id="struct-203066" status="VALID">
<orgName>Bibliothèque nationale de France, Délégation à la Stratégie et à la recherche</orgName>
<orgName type="acronym">BnF_DSG</orgName>
<desc>
<address>
<addrLine>Quai François Mauriac, 75706 Paris cedex 13</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.bnf.fr/fr/la_bnf/strategie_recherche.html</ref>
</desc>
<listRelation>
<relation active="#struct-302209" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-302209" type="indirect">
<org type="institution" xml:id="struct-302209" status="VALID">
<orgName>Bibliothèque Nationale de France</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-204893" type="direct">
<org type="laboratory" xml:id="struct-204893" status="VALID">
<orgName>Laboratoire d'Informatique de l'Université de Tours</orgName>
<orgName type="acronym">LI</orgName>
<desc>
<address>
<addrLine>64, Avenue Jean Portalis, 37200 Tours</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.li.univ-tours.fr/</ref>
</desc>
<listRelation>
<relation name="EA6300" active="#struct-300298" type="direct"></relation>
<relation active="#struct-300408" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="EA6300" active="#struct-300298" type="indirect">
<org type="institution" xml:id="struct-300298" status="VALID">
<orgName>Université François Rabelais - Tours</orgName>
<desc>
<address>
<addrLine>60 rue du Plat d'Étain, 37020 Tours cedex 1 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-tours.fr</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300408" type="indirect">
<org type="institution" xml:id="struct-300408" status="VALID">
<orgName>Polytech'Tours</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Le Havre</settlement>
<region type="region" nuts="2">Région Normandie</region>
<region type="old region" nuts="2">Haute-Normandie</region>
</placeName>
<orgName type="university">Université du Havre</orgName>
<placeName>
<settlement type="city">Rouen</settlement>
<region type="region" nuts="2">Région Normandie</region>
<region type="old region" nuts="2">Haute-Normandie</region>
</placeName>
<orgName type="university">Université de Rouen</orgName>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:tel-01164698</idno>
<idno type="halId">tel-01164698</idno>
<idno type="halUri">https://hal-bnf.archives-ouvertes.fr/tel-01164698</idno>
<idno type="url">https://hal-bnf.archives-ouvertes.fr/tel-01164698</idno>
<date when="2014-07-11">2014-07-11</date>
<idno type="wicri:Area/Hal/Corpus">000027</idno>
<idno type="wicri:Area/Hal/Curation">000027</idno>
<idno type="wicri:Area/Hal/Checkpoint">000120</idno>
<idno type="wicri:Area/Main/Merge">000245</idno>
<idno type="wicri:Area/Main/Curation">000244</idno>
<idno type="wicri:Area/Main/Exploration">000244</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Adaptative quality control of digital documents in mass digitization projects</title>
<title xml:lang="fr">Maîtrise de la qualité des transcriptions numériques dans les projets de numérisation de masse</title>
<author>
<name sortKey="Ben Salah, Ahmed" sort="Ben Salah, Ahmed" uniqKey="Ben Salah A" first="Ahmed" last="Ben Salah">Ahmed Ben Salah</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-399419" status="INCOMING">
<orgName>DocApp et Rfai</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-23832" type="direct"></relation>
<relation active="#struct-300317" type="indirect"></relation>
<relation name="EA4108" active="#struct-300318" type="indirect"></relation>
<relation active="#struct-301288" type="indirect"></relation>
<relation active="#struct-301232" type="indirect"></relation>
<relation active="#struct-203066" type="direct"></relation>
<relation active="#struct-302209" type="indirect"></relation>
<relation active="#struct-204893" type="direct"></relation>
<relation name="EA6300" active="#struct-300298" type="indirect"></relation>
<relation active="#struct-300408" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-23832" type="direct">
<org type="laboratory" xml:id="struct-23832" status="VALID">
<orgName>Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes</orgName>
<orgName type="acronym">LITIS</orgName>
<desc>
<address>
<addrLine>Avenue de l'Université UFR des Sciences et Techniques 76800 Saint-Etienne du Rouvray</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.litislab.eu</ref>
</desc>
<listRelation>
<relation active="#struct-300317" type="direct"></relation>
<relation name="EA4108" active="#struct-300318" type="direct"></relation>
<relation active="#struct-301288" type="direct"></relation>
<relation active="#struct-301232" type="indirect"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-300317" type="indirect">
<org type="institution" xml:id="struct-300317" status="VALID">
<orgName>Université du Havre</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA4108" active="#struct-300318" type="indirect">
<org type="institution" xml:id="struct-300318" status="VALID">
<orgName>Université de Rouen</orgName>
<desc>
<address>
<addrLine> 1 rue Thomas Becket - 76821 Mont-Saint-Aignan</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-rouen.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-301288" type="indirect">
<org type="department" xml:id="struct-301288" status="VALID">
<orgName>Institut National des Sciences Appliquées - Rouen</orgName>
<orgName type="acronym">INSA Rouen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-301232" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-301232" type="indirect">
<org type="institution" xml:id="struct-301232" status="VALID">
<orgName>Institut National des Sciences Appliquées</orgName>
<orgName type="acronym">INSA</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-203066" type="direct">
<org type="laboratory" xml:id="struct-203066" status="VALID">
<orgName>Bibliothèque nationale de France, Délégation à la Stratégie et à la recherche</orgName>
<orgName type="acronym">BnF_DSG</orgName>
<desc>
<address>
<addrLine>Quai François Mauriac, 75706 Paris cedex 13</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.bnf.fr/fr/la_bnf/strategie_recherche.html</ref>
</desc>
<listRelation>
<relation active="#struct-302209" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-302209" type="indirect">
<org type="institution" xml:id="struct-302209" status="VALID">
<orgName>Bibliothèque Nationale de France</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-204893" type="direct">
<org type="laboratory" xml:id="struct-204893" status="VALID">
<orgName>Laboratoire d'Informatique de l'Université de Tours</orgName>
<orgName type="acronym">LI</orgName>
<desc>
<address>
<addrLine>64, Avenue Jean Portalis, 37200 Tours</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.li.univ-tours.fr/</ref>
</desc>
<listRelation>
<relation name="EA6300" active="#struct-300298" type="direct"></relation>
<relation active="#struct-300408" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="EA6300" active="#struct-300298" type="indirect">
<org type="institution" xml:id="struct-300298" status="VALID">
<orgName>Université François Rabelais - Tours</orgName>
<desc>
<address>
<addrLine>60 rue du Plat d'Étain, 37020 Tours cedex 1 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-tours.fr</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300408" type="indirect">
<org type="institution" xml:id="struct-300408" status="VALID">
<orgName>Polytech'Tours</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Le Havre</settlement>
<region type="region" nuts="2">Région Normandie</region>
<region type="old region" nuts="2">Haute-Normandie</region>
</placeName>
<orgName type="university">Université du Havre</orgName>
<placeName>
<settlement type="city">Rouen</settlement>
<region type="region" nuts="2">Région Normandie</region>
<region type="old region" nuts="2">Haute-Normandie</region>
</placeName>
<orgName type="university">Université de Rouen</orgName>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="mix" xml:lang="en">
<term>Optical Character Recognition</term>
<term>Quality Assessment</term>
<term>Segmentation defects</term>
<term>Texture Characterization</term>
</keywords>
<keywords scheme="mix" xml:lang="fr">
<term>Analyse de texture</term>
<term>Classification</term>
<term>Erreur de segmentation</term>
<term>Prédiction de performances</term>
<term>Reconnaissance de caractères</term>
<term>Reconnaissance optique de caractères</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Classification</term>
<term>Reconnaissance optique de caractères</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This work focuses on the assessment of characters recognition results produced automatically by optical character recognition software (OCR on mass digitization projects. The goal is to design a global control system robust enough to deal with BnF documents collection. This collection includes old documents which are difficult to be treated by OCR. We designed a word detection system to detect missed words defects in OCR results, and a words recognition rate estimator to assess the quality of word recognition results performed by OCR.We create two kinds of descriptors to characterize OCR outputs. Image descriptors to characterize page segmentation results and cross alignment descriptors to characterize the quality of word recognition results. Furthermore, we adapt our learning process to make an adaptive decision or prediction systems. We evaluated our control systems on real images selected randomly from BnF collection. The mmissed word detection system detects 84.15% of words omitted by the OCR with a precision of 94.73%. The experiments performed also showed that 80% of the documents of word recognition rate less than 98% are detected with an accuracy of 92%. It can also automatically detect 45% of the material having a recognition rate less than 70% with greater than 92% accuracy.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
<region>
<li>Centre-Val de Loire</li>
<li>Haute-Normandie</li>
<li>Région Centre</li>
<li>Région Normandie</li>
</region>
<settlement>
<li>Le Havre</li>
<li>Rouen</li>
<li>Tours</li>
</settlement>
<orgName>
<li>Centre Val de Loire Université</li>
<li>Université François-Rabelais de Tours</li>
<li>Université de Rouen</li>
<li>Université du Havre</li>
</orgName>
</list>
<tree>
<country name="France">
<region name="Région Normandie">
<name sortKey="Ben Salah, Ahmed" sort="Ben Salah, Ahmed" uniqKey="Ben Salah A" first="Ahmed" last="Ben Salah">Ahmed Ben Salah</name>
</region>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/France/explor/LeHavreV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000244 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000244 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/France
   |area=    LeHavreV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Hal:tel-01164698
   |texte=   Adaptative quality control of digital documents in mass digitization projects
}}

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Sat Dec 3 14:37:02 2016. Site generation: Tue Mar 5 08:25:07 2024